Impact of Local Data Characteristics on Learning Rules from Imbalanced Data

نویسنده

  • Jerzy Stefanowski
چکیده

In this paper we discus improving rule based classifiers learned from class imbalanced data. Standard learning methods often do not work properly with imbalanced data as they are biased to focus on the majority classes while " disregarding " examples from the minority class. The class imbalance affects various types of classifiers, including the rule-based ones. These difficulties include two groups of reasons – algorithmic and data level ones. The algorithmic factors include the following issues. First, most algorithms induce rules using the top-down technique, which hinders finding rules for smaller sets of learning examples, especially from the minority class. It is also connected with using improper evaluation measures to guide the search for best conditions in the induced rule and also for further rule pruning. Secondly, most algorithms use a greedy sequential covering approach, which may increase the data fragmentation and results in weaker rules, i.e., supported by a small number of learning examples. The " weakness " of the minority class rules influences classification strategies, where minority rules have a smaller chance to contribute to the final classification decision. The other difficulties concern characteristics of imbalanced data distributions. Learning classifiers becomes particularly difficult when other data characteristics occur together with imbalanced distribution of classes, such as decomposition of the minority class into many rare sub-concepts, too extensive overlapping of decision classes or presence of minority class examples inside the majority class regions. In our previous study [2], these data difficulty factors have been associated with different types of examples from the minority class: safe (located in the homogeneous regions populated by the examples from one class only), borderline, rare cases and outliers. The aim of this study is to present two different, recent algorithms proposed by K.Napierla and J.Stefanowski for inducing classification rules from imbal-anced data and to show the usefulness of studying local data characteristics for two sub-tasks: (1) improving rule classifiers by incorporating types of examples into the induction strategy; (2) applying the analysis of minority class examples to identify differences in performance of these algorithms and establishing their area of competence. The BRACID (Bottom-up induction of Rules And Cases for Imbalanced Data) algorithm is constructed following the critical analysis of limitations of current rule algorithms [3]. Its main features include: the bottom-up generalization of the most specific rules representing single examples in order to overcome the problems of data fragmentation; using twofold rule-based and instance-based

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

Increasing the Interpretability of Rules Induced from Imbalanced Data by Using Bayesian Confirmation Measures

Approaches to support an interpretation of rules induced from imbalanced data are discussed. In this paper, the rule learning algorithm BRACID dedicated to class imbalance is considered. As it may induce too many rules, which hinders their interpretation, their filtering should be applied. We introduce three different post-pruning strategies, which aim at selecting rules having good descriptive...

متن کامل

INDUCING VALUABLE RULES FROM IMBALANCED DATA: THE CASE OF AN IRANIAN BANK EXPORT LOANS

<span style="color: #000000; font-family: Tahoma, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: -webkit-left; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none; ba...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015